Introduction

Every day, it is estimated that the world produces roughly 2.5 quintillion bytes of information, a large fraction of which is comprised of text data. These data come in the form of invoices, contracts, proposals, news articles, scientific journals, and ebooks among countless others sources. With such massive volumes of text being collected all the time, it's often impossible for a human to parse through all of it manually in an efficient way. As such, there is a great need for automated text processing methods to tag and cluster documents, extract meaning from them, and, in some cases, identify their sentiment.

The ability to process text automatically is especially important in the business setting, where product reviews and customer feedback are extremely valuable in helping shape a brand or product to better fit consumer needs. According to this source, approximately 80% of potential customers trust online reviews as much as personal recommendations from friends or family members, and a whopping 88% of them incorporate product reviews into their decision of whether or not they will ultimately purchase an item. Being able to quickly weed through and process thousands of comments and reviews to find out what people are saying about a product can give a business a significant advantage over their competitors.

In this project we'll check out some of the most commonly used methods of topic modeling and document clustering methods employed today. These include techniques like $n$-gram anlysis, part-of-speech tagging, term frequency$-$inverse document frequency, k$-$means clustering, and latent Dirichlet allocation. The text data used for this project consist of news article descriptions scraped from various sources around the world like The Guardian, Bloomberg, and the Associated Press, among many others. This project was heavily influenced by the excellent data science blog post written by Ahmed Besbes, which can be found here.

Data Collection

To get the news article data that we'll use in this project, we can make use of the handy newsapi.org API, which happens to be very simple to use. On the newsapi homepage, one will see a "GET API KEY" in the upper right-hand corner. After clicking this link, you'll be asked to provide an email address and a password...and that's all there is to it! Well, almost. You'll be presented with something called an "API key", which is a long series of numbers and letters. This is sort of like a password, and is unique to each individual user. To try the API, simply type:

https://newsapi.org/v1/articles?source={NEWSSOURCE}&apiKey={APIKEY}

in the search bar. Before doing so, however, replace {APIKEY} with your own API key, and replace {NEWSSOURCE} with a news source like Bloomberg. So our link would then become something like:

https://newsapi.org/v1/articles?source=bloomberg&apiKey=633qw20c29984ce08504f5c89c2cmm02

Note that in the above example, that particular API key does not work, but is just used to demonstrate how this would be done in practice. The output looks something like this:

In [32]:
import requests

site = 'https://newsapi.org/v1/articles?source=bloomberg&apiKey=636cf20c29984ce08504f5c89c2cee02'
source_data = requests.get(site, allow_redirects=True).json()
print(json.dumps(source_data, indent=2)[:1890])
{
  "status": "ok",
  "source": "bloomberg",
  "sortBy": "top",
  "articles": [
    {
      "author": "Terrence Dopp, Chris Strohm",
      "title": "Trump Keeps Controversies Smoldering With a Warning for Comey",
      "description": "President Donald Trump fired off warnings and barely veiled threats in a burst of morning tweets that raised new questions about the dismissal of former FBI Director James Comey and kept attention on the investigation of Russian political meddling.",
      "url": "https://www.bloomberg.com/politics/articles/2017-05-12/comey-should-hope-there-are-no-tapes-trump-tweets",
      "urlToImage": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/iaIMBW8gh14g/v0/1200x675.jpg",
      "publishedAt": "2017-05-12T13:13:33.48Z"
    },
    {
      "author": "Alexis Leondis, Lynnley Browning",
      "title": "Trump Lawyers: Tax Returns Show Little Income From Russians",
      "description": "President Donald Trump\u2019s personal lawyers said his tax returns from the past 10 years show that -- with a few exceptions -- he received no income from Russian sources and owed no debts to Russian lenders.",
      "url": "https://www.bloomberg.com/politics/articles/2017-05-12/trump-lawyers-say-tax-returns-show-little-income-from-russians",
      "urlToImage": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/iyNfnnn_xYXc/v0/1200x675.jpg",
      "publishedAt": "2017-05-12T15:52:38.9Z"
    },
    {
      "author": "Sho Chandra",
      "title": "The U.S. Economy Is Back on Track",
      "description": "The U.S. economy is back on track for steady growth, though not much more.",
      "url": "https://www.bloomberg.com/news/articles/2017-05-12/it-s-back-to-steady-for-u-s-economy-as-retail-sales-cpi-rise",
      "urlToImage": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ifoXm.KiiYCk/v0/1200x800.jpg",
      "publishedAt": "2017-05-12T15:50:31.123Z"
    }

Each output is a set of nested dictionaries with information regarding the

  • Article source (Bloomberg in this case)
  • Sortby (top, latest, or popular)
  • Author
  • Article title
  • Article description
  • Article url
  • Article image url
  • Date of publish

Automatically Collecting the Data

One of the great things about newsapi is that this sort of data can be collected from over 70 news sources and blogs from various parts of the world, some of which are not even in english. In order to fetch the data like we've done above for many news sources, we need a way to automate the process. In general, the tasks we need to perform are:

  • Loop over a bunch of new sources (e.g., Bloomberg, New York Times, etc.).
  • For each news source, grab the top, latest, and popular articles.
  • Get information from each article regarding title, description, url, etc.
  • Save this data to a csv file for easy reading later on.

To this end, I wrote a Python module called scrapeNews.py to automatically perform the tasks listed above. I should note again that Ahmed Besbes' data science blog was an outstanding source of information for how to do this.

In [ ]:
import pandas as pd
import requests
import os.path
import csv

API_key = '636cf20c29984ce08504f5c89c2cee02'
outdir = '/Users/degravek/Downloads/'

def articleType():
    sort_by = ['top', 'latest', 'popular']
    source_id = ['associated-press', 'bbc-news', 'bbc-sport', 'bloomberg',
                'business-insider', 'cnbc', 'cnn', 'daily-mail', 'engadget',
                'entertainment-weekly', 'espn', 'financial-times', 'fortune',
                'four-four-two', 'fox-sports', 'google-news', 'hacker-news', 'mtv-news',
                'national-geographic', 'new-scientist', 'newsweek', 'nfl-news', 'reuters',
                'talksport', 'techcrunch', 'techradar', 'the-economist',
                'the-guardian-uk', 'the-huffington-post', 'the-new-york-times',
                'the-next-web', 'the-sport-bible', 'the-telegraph', 'the-verge',
                'the-wall-street-journal', 'the-washington-post', 'time', 'usa-today',
                'ars-technica', 'al-jazeera-english']
    return source_id, sort_by

def getCategories():
    site = 'https://newsapi.org/v1/sources'
    source_info = requests.get(site).json()

    source_categories = {}
    for element in source_info['sources']:
        source_categories[element['id']] = element['category']
    return source_categories

def writeData(output):
    file_name = outdir + 'news_articles.csv'

    df = pd.DataFrame(output, columns=['publishedAt', 'author', 'category', 'title',
                                        'description', 'url'])
    if os.path.isfile(file_name):
        articles = pd.read_csv(file_name)

        df = df.append(articles)
        df = df.drop_duplicates(subset='url')
        df.to_csv(file_name, mode='w', encoding='utf-8', index=False, header=True)
    else:
        df = df.drop_duplicates(subset='url')
        df.to_csv(file_name, mode='w', encoding='utf-8', index=False, header=True)

def scrapeNews():
    source_id, sort_by = articleType()
    source_categories = getCategories()

    output = []
    for sid in source_id:
        for sb in sort_by:
            # Get news article json
            site = 'https://newsapi.org/v1/articles?source=' + sid + '&sortBy=' + \
                    sb + '&apiKey=' + API_key

            source_data = requests.get(site).json()

            try:
                for element in source_data['articles']:
                    if not element['author']: element['author'] = 'no_author'
                    output.append([element['publishedAt'], element['author'],
                                    source_categories[sid], element['title'],
                                    element['description'], element['url']])
            except:
                pass

    writeData(output)

if __name__ == '__main__':
    scrapeNews()

The code above might look daunting at first, but it's actually not so bad. At the top, I define my API key and the directory where my file containing news articles will be saved. The function articleType() defines which news sources we're interested in grabbing articles from. From these sources, scrapeNews() will then grab the top, latest, and popular articles.

Every news source available is given a "category" by newsapi. For example, Bloomberg is given a "general" tag as this source generally covers many different news topics, while ESPN is given a "sport" tag. One will note that in our example earlier, though, category was not a key in the dictionary. The categories for each source can be instead be found here:

https://newsapi.org/v1/sources

The function getCategories() loops over every possible news source and grabs the corresponding category from the list. The result is a dictionary where the first ten entries look something like this:

In [67]:
site = 'https://newsapi.org/v1/sources'
source_info = requests.get(site).json()

source_categories = {}
for element in source_info['sources']:
    source_categories[element['id']] = element['category']

dict(zip(list(source_categories.keys())[:10], list(source_categories.values())[:10]))
Out[67]:
{'abc-news-au': 'general',
 'al-jazeera-english': 'general',
 'ars-technica': 'technology',
 'associated-press': 'general',
 'bbc-news': 'general',
 'bbc-sport': 'sport',
 'bild': 'general',
 'bloomberg': 'business',
 'breitbart-news': 'politics',
 'business-insider': 'business'}

The function scrapeNews() then loops over each news source we've decided to use, finds its coresponding category using the dictionary above, and then grabs all the relevant news article information from that source using our API key.

Because we're collecting the top, latest, and popular articles for each source, there are bound to be duplicate articles included when we're done. Therefore, before writing the data to file, we load in any previously save versions of the file and drop duplicate entries based on the article url column. All of this is done in writeData().

The Cron Job

We can run scrapeNews.py easily by calling python scrapeNews.py from the terminal command line. One will notice though that the number of news articles scraped after this run is only on the order of a few hundred or so $-$ not nearly enough for our purposes. We could manually run the code every 15 minutes or so, but that would be pretty tedious. To get around this, we can set up something called a "cron job" to execute the script for us automatically over fixed time intervals.

Setting up a cron job is pretty easy. In our case (working on a Mac), we need to follow three steps:

  1. cd to where scrapeNews.py is located and type chmod +x scrapeNews.py to make the file executable
  2. From the terminal command line, type crontab -e
  3. In this prompt, enter the information regarding when the code will run (described below)
  4. Type (if using nano) control + o to save the changes, then control + x to exit

For step (3), the input will look something like:

* * * * * ~/anaconda/bin/python /Users/degravek/Site/news/scrapeNews.py

The second term in the line above (~/anaconda/bin/python) is the path to my python distribution, while the third term (/Users/degravek/Site/news/scrapeNews.py) is the absolute path to where scrapeNews.py is located. The five asterisks are palceholders that determine how frequently you want the code to be run, and their meaning is described well by the diagram below.

chronjob

For this project, I ran scrapeNews.py every 15 minutes for a several days, so my input looked like

*/15 * * * * ~/anaconda/bin/python /Users/degravek/Site/news/scrapeNews.py

When enough data has been collected, simply type control -r in the terminal command line to stop the cron job.

Okay, now let's start looking at the data! Before we get started, let's import some useful Python libraries that will come in handy later.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from sklearn.manifold import TSNE
from collections import Counter
from string import punctuation
import pandas as pd
import numpy as np
import nltk
import re

import matplotlib.pyplot as mp
%matplotlib inline

from plotly.offline import download_plotlyjs
from plotly.offline import init_notebook_mode
from plotly.offline import plot, iplot
import cufflinks as cf

cf.go_offline()

stoplist = stopwords.words('english')

We now load in the csv containing the news article data, and drop any duplicates that may have gotten through.

In [2]:
df = pd.read_csv('/Users/degravek/Downloads/news_articles.csv')

df = df.drop_duplicates(subset='url')
df = df.drop_duplicates(subset='title')
df = df.reset_index(drop=True)
df = df[df['description'].notnull()].reset_index(drop=True)

df.loc[df['category'] == 'science-and-nature', 'category'] = 'science-nature'

Let's see what the data look like.

In [3]:
df.head()
Out[3]:
publishedAt author category title description url
0 2017-05-22T14:23:47+00:00 CHAD DAY general APNewsBreak: Source says Flynn to invoke 5th A... WASHINGTON (AP) — Former National Security Adv... https://apnews.com/9ede3ab68f734789a10ab6800f2...
1 2017-05-22T13:41:59+00:00 JONATHAN LEMIRE and JULIE PACE general Trump: Israelis and Arabs share 'common cause'... JERUSALEM (AP) — President Donald Trump opened... https://apnews.com/6a1462f8dad542789e600afdfdf...
2 2017-05-22T14:39:07+00:00 MARYCLAIRE DALE general Race, gender, fame all issues as Cosby jury se... PHILADELPHIA (AP) — Thirteen years after a Tem... https://apnews.com/465bed401c6340a99078e43fd5d...
3 2017-05-22T13:04:29+00:00 JUSTIN PRITCHARD, GILLIAN FLACCUS and REESE DU... general Growing grassroots movements confronting schoo... FOREST GROVE, Ore. (AP) — A pair of Oregon sch... https://apnews.com/b3199fa831f24b5fbb24b9daf6b...
4 2017-05-22T12:30:22+00:00 DEE-ANN DURBIN and TOM KRISHER general Ford replaces CEO Mark Fields in push to trans... DETROIT (AP) — Ford is replacing CEO Mark Fiel... https://apnews.com/9b6defe6f1264921b440a4a4954...

As discussed earlier, we have collected the

  • Publish date
  • Author
  • Category
  • Article itle
  • Article Description
  • Article url

We can check out the distribution of categories. I use plotly and cufflinks to do this, as they produce nice interactive figures with functions call similar to Pandas.

In [41]:
df['category'].value_counts().iplot(kind="bar", xTitle='Category', color='blue',
                            yTitle='Counts', dimensions=(790,550), margin=(100,100,60,0))

In all, the news articles fall into seven categories:

  • General
  • Technology
  • Business
  • Sport
  • Entertainment
  • Science & Nature
  • Music

General is sort of a catch-all category (often containing political and world news), and therefore has many more articles than the other categories. There are also more of these types of news sources that we've included in scrapeNews.py. Among the seven categories, we see that the Science & Nature and Music make up a small fraction of the total number of articles. Let's take a look at example article descriptions for some of these categories.

In [5]:
for category in set(df['category']):
    print('Category:', category)
    print(df[df['category'] == category]['description'].tolist()[0])
    print('--------------------')
Category: sport
A review of the final day of the Premier League season, the latest transfer news and gossip, as well as the best of social media.
--------------------
Category: music
Engineer Geoff Turner takes us through the song’s 1996 demo session
--------------------
Category: entertainment
'Viper' MacDonald and his son reportedly became suspicious and swooped in when they saws the man loitering around the youngsters in Hallglen Park, Falkirk.
--------------------
Category: business
President Donald Trump landed in Israel on a groundbreaking direct flight Monday from Saudi Arabia and expanded a core theme of his Sunday speech in Riyadh: the U.S. will stand with Arab nations and Israel against the threats they all agree are posed by Iran.
--------------------
Category: technology
DNP  Welcome to Tomorrow, a new section on Engadget that delves into future of, well, everything.
--------------------
Category: general
WASHINGTON (AP) — Former National Security Adviser Michael Flynn will invoke his Fifth Amendment protection against self-incrimination as he notifies a Senate panel that he won't hand over documents in the probe into Russia's meddling in the 2016 election, according to a person with direct knowledge of the matter. The notification will come in a letter to the Senate Intelligence committee expected later Monday. The person providing details spoke on condition anonymity in order to discuss private interactions between Flynn and the committee.
--------------------
Category: science-nature
While President Trump wants to revive America’s coal industry, India is embracing renewables, LED lighting, electric cars, and more.
--------------------

Alright, let's get to the modeling! We'll start with $n$-gram analysis.

$n$-Gram Analysis

The method of $n$-gram analysis is one of the most common types of topic modeling techniques employed today. By $n$-gram analysis, I mean examining combinations of $n$ sequential words in a sentence. For example, the 1-grams (called unigrams) and 2-grams (called bigrams) of the sentence "this is an example of a sentence" would be:

  • Unigrams: this, is, an, example, of, a, sentence
  • Bigrams: this is, is an, an example, example of, of a, a sentence

In the first case, the unigrams are just a list of all single words in the sentence. In the second case, the bigrams are all sequential two-word combinations.

By looking at which $n$-grams occur most frequently in a document, we can get a sense for what is being talked about. In python, the $n$-grams can be computed in two ways $-$ manually using a function that we define ourselves, or automatically using Python Learn's CountVectorizer method. Let's try both!

Before the $n$-grams can be computed, we first have to process the text a little bit to remove puctuation and stop words. Stopwords are words that occur so frequently in normal speech that they're of little interest to us. These include words like "the", "and", etc. We can define two functions below to perform these tasks $-$ they're called processText() and rmStopwords(). A third function called nGrams takes in some text, tokenizes it (i.e., splits it into its individual words) and finds the $n$ sequential word combinations. In this manual implementation, we can also perform an additional task that is helpful in identifying interesting words. If we're finding unigrams, the function tags the part of speech of each word and keeps only nouns, which are more likely to be the subject of a piece of text.

In [6]:
punctuation = '!"#?$%“”&\'()’*+,—./:;<=>@[\\]^_`{|}~…' #(removed - symbol)
def processText(text):
    result = text.lower()
    result = ''.join(word for word in result
                    if word not in punctuation)
    result = re.sub(r' +', " ", result).strip()
    return result

def rmStopwords(text):
    result = text.lower().split()
    result = ' '.join(word for word in result
                      if word not in stoplist)
    result = re.sub(r' +', " ", result).strip()
    return result

def nGrams(text, n):
    result = []
    text = text.split()
    if n==1:
        partspeech = nltk.pos_tag(text)
        result = [word for word, pos in
                  partspeech if pos[0] == 'N']
    else:
        for i in range(len(text)-(n-1)):
            result.append(' '.join(text[i:i+n]))
    if not result:
        result = np.nan
    return result

With these functions defined, we can now process the text and compute $n$-grams. In processing the text, we keep only those news articles where the description has more than 140 characters, as shorter descriptions sometimes don't contain a lot of useful information and generally add noise when performing topic modeling.

In [7]:
df['pro_descr'] = df['description'].apply(rmStopwords).apply(processText)
df['num_chars'] = df['pro_descr'].apply(len)

df = df[df['num_chars'] > 140].reset_index(drop=True)

df['ngram_1']   = df['pro_descr'].apply(nGrams, args=(1,))
df['ngram_2']   = df['pro_descr'].apply(nGrams, args=(2,))
df['ngram_3']   = df['pro_descr'].apply(nGrams, args=(3,))

Let's look at the unigrams, bigrams, and trigrams.

In [8]:
df[['description','ngram_1','ngram_2','ngram_3']].head()
Out[8]:
description ngram_1 ngram_2 ngram_3
0 WASHINGTON (AP) — Former National Security Adv... [washington, security, adviser, michael, amend... [washington ap, ap former, former national, na... [washington ap former, ap former national, for...
1 JERUSALEM (AP) — President Donald Trump opened... [jerusalem, ap, president, trump, visit, israe... [jerusalem ap, ap president, president donald,... [jerusalem ap president, ap president donald, ...
2 PHILADELPHIA (AP) — Thirteen years after a Tem... [philadelphia, ap, years, university, basketba... [philadelphia ap, ap thirteen, thirteen years,... [philadelphia ap thirteen, ap thirteen years, ...
3 FOREST GROVE, Ore. (AP) — A pair of Oregon sch... [grove, ore, pair, school, districts, intent, ... [forest grove, grove ore, ore ap, ap pair, pai... [forest grove ore, grove ore ap, ore ap pair, ...
4 DETROIT (AP) — Ford is replacing CEO Mark Fiel... [ap, ford, mark, fields, business, provider, m... [detroit ap, ap ford, ford replacing, replacin... [detroit ap ford, ap ford replacing, ford repl...

We can now see which $n$-grams occur most frequently in each category.

In [9]:
def keywords_ngram(category, n, num):
    text = df[df['category'] == category]['ngram_' + str(n)]
    result = []
    for word_list in text:
        result += word_list
    return Counter(result).most_common(num)

for category in set(df['category']):
    print("Category:", category)
    print("1-grams:", keywords_ngram(category, 1, 10))
    print('--------------------')
    print("2-grams:", keywords_ngram(category, 2, 10))
    print('--------------------')
    print("3-grams:", keywords_ngram(category, 3, 10))
    print('--------------------')
Category: sport
1-grams: [('season', 117), ('league', 100), ('time', 48), ('club', 47), ('game', 47), ('team', 43), ('football', 38), ('premier', 38), ('summer', 34), ('goals', 33)]
--------------------
2-grams: [('premier league', 74), ('manchester united', 25), ('new york', 18), ('champions league', 18), ('los angeles', 18), ('league final', 17), ('league season', 15), ('europa league', 15), ('golden boot', 14), ('real madrid', 12)]
--------------------
3-grams: [('premier league season', 14), ('europa league final', 14), ('2017 nba draft', 10), ('los angeles lakers', 9), ('western conference finals', 6), ('tampa bay rays', 6), ('stanley cup playoffs', 6), ('free agency draft', 5), ('seasons champions league', 5), ('toronto blue jays', 5)]
--------------------
Category: entertainment
1-grams: [('president', 16), ('star', 11), ('night', 10), ('police', 10), ('james', 10), ('trump', 9), ('home', 8), ('director', 8), ('state', 7), ('donald', 7)]
--------------------
2-grams: [('donald trump', 9), ('fbi director', 6), ('premier league', 5), ('president donald', 5), ('director james', 5), ('saudi arabia', 4), ('last year', 4), ('james matthews', 3), ('golden state', 3), ('san antonio', 3)]
--------------------
3-grams: [('president donald trump', 5), ('fbi director james', 5), ('fired fbi director', 3), ('director james comey', 3), ('pippa middletons wedding', 2), ('james matthews wedding', 2), ('former fbi director', 2), ('director james comeys', 2), ('wikileaks founder julian', 2), ('founder julian assange', 2)]
--------------------
Category: business
1-grams: [('president', 61), ('donald', 33), ('director', 19), ('trump', 15), ('house', 13), ('investigation', 13), ('administration', 11), ('years', 11), ('james', 10), ('crisis', 10)]
--------------------
2-grams: [('president donald', 33), ('donald trump', 26), ('donald trumps', 17), ('fbi director', 16), ('saudi arabia', 9), ('james comey', 9), ('president michel', 9), ('white house', 9), ('michel temer', 8), ('former fbi', 7)]
--------------------
3-grams: [('president donald trump', 21), ('president donald trumps', 12), ('president michel temer', 8), ('former fbi director', 7), ('fbi director james', 6), ('director james comey', 6), ('us president donald', 4), ('treasury secretary steven', 4), ('director robert mueller', 4), ('former national security', 3)]
--------------------
Category: technology
1-grams: [('google', 41), ('today', 35), ('work', 31), ('time', 30), ('company', 29), ('technology', 27), ('day', 24), ('years', 23), ('data', 22), ('people', 21)]
--------------------
2-grams: [('machine learning', 10), ('neural network', 10), ('google io', 9), ('self-driving car', 9), ('windows 10', 8), ('licensing program', 8), ('time think', 7), ('computer science', 7), ('san francisco', 6), ('google assistant', 6)]
--------------------
3-grams: [('self-driving car project', 4), ('today google io', 3), ('cluster docker swarm', 3), ('autonomous driving technology', 3), ('waymo alleges levandowski', 3), ('fraunhofers mp3 software', 3), ('work schedule doesnt', 2), ('creativity critical thinking', 2), ('moment-to-moment tactical issues', 2), ('google io developer', 2)]
--------------------
Category: general
1-grams: [('president', 422), ('trump', 201), ('donald', 143), ('investigation', 101), ('director', 98), ('election', 92), ('house', 88), ('ap', 86), ('campaign', 81), ('washington', 79)]
--------------------
2-grams: [('president donald', 201), ('donald trump', 171), ('donald trumps', 78), ('fbi director', 78), ('james comey', 60), ('said thursday', 53), ('new york', 52), ('white house', 49), ('director james', 48), ('us president', 47)]
--------------------
3-grams: [('president donald trump', 140), ('president donald trumps', 59), ('director james comey', 47), ('us president donald', 43), ('fbi director james', 42), ('former fbi director', 40), ('national security adviser', 30), ('security adviser michael', 20), ('president michel temer', 20), ('president hassan rouhani', 18)]
--------------------
Category: science-nature
1-grams: [('bike', 2), ('culture', 1), ('town', 1), ('needs', 1), ('quality', 1), ('mountain', 1), ('hamlets', 1), ('country', 1), ('order', 1), ('offer', 1)]
--------------------
2-grams: [('foster authentic', 1), ('authentic bike', 1), ('bike culture', 1), ('culture town', 1), ('town needs', 1), ('needs high', 1), ('high quality', 1), ('quality trails', 1), ('trails 20', 1), ('20 mountain', 1)]
--------------------
3-grams: [('foster authentic bike', 1), ('authentic bike culture', 1), ('bike culture town', 1), ('culture town needs', 1), ('town needs high', 1), ('needs high quality', 1), ('high quality trails', 1), ('quality trails 20', 1), ('trails 20 mountain', 1), ('20 mountain bike', 1)]
--------------------

Now let's perform the same exercise, but using Scikit-Learn's CountVectorizer instead.

In [10]:
def keywords_cvec(category, n, num):
    text = df[df['category'] == category]['pro_descr']

    vector = CountVectorizer(min_df=1, strip_accents='unicode', analyzer='word',
                     token_pattern=r'\w{1,}', ngram_range=(n,n), stop_words=stoplist)
    cvec = vector.fit_transform(text)

    feature_names = vector.get_feature_names()
    feature_count = cvec.toarray().sum(axis=0)
    sort_feature = sorted(zip(feature_names, feature_count), key=lambda x: x[1], reverse=True)[:num]
    return sort_feature
In [11]:
for category in set(df['category']):
    print("Category:", category)
    print("1-grams:", keywords_cvec(category, 1, 10))
    print('--------------------')
    print("2-grams:", keywords_cvec(category, 2, 10))
    print('--------------------')
    print("3-grams:", keywords_cvec(category, 3, 10))
    print('--------------------')
Category: sport
1-grams: [('league', 147), ('season', 119), ('final', 96), ('year', 86), ('premier', 79), ('first', 62), ('old', 58), ('time', 57), ('club', 56), ('game', 52)]
--------------------
2-grams: [('premier league', 74), ('year old', 45), ('manchester united', 25), ('champions league', 18), ('los angeles', 18), ('new york', 18), ('league final', 17), ('europa league', 15), ('league season', 15), ('golden boot', 14)]
--------------------
3-grams: [('europa league final', 14), ('premier league season', 14), ('2017 nba draft', 10), ('los angeles lakers', 9), ('23 year old', 7), ('36 year old', 6), ('stanley cup playoffs', 6), ('tampa bay rays', 6), ('western conference finals', 6), ('2017 stanley cup', 5)]
--------------------
Category: entertainment
1-grams: [('year', 27), ('old', 23), ('new', 19), ('president', 16), ('trump', 16), ('first', 15), ('one', 15), ('former', 13), ('pictured', 13), ('said', 13)]
--------------------
2-grams: [('year old', 20), ('donald trump', 9), ('fbi director', 6), ('director james', 5), ('premier league', 5), ('president donald', 5), ('last year', 4), ('saudi arabia', 4), ('fired fbi', 3), ('golden state', 3)]
--------------------
3-grams: [('fbi director james', 5), ('president donald trump', 5), ('director james comey', 3), ('fired fbi director', 3), ('14 year old', 2), ('16 year old', 2), ('34 year old', 2), ('41 year old', 2), ('director james comeys', 2), ('former fbi director', 2)]
--------------------
Category: business
1-grams: [('president', 61), ('us', 51), ('donald', 43), ('trump', 40), ('said', 26), ('new', 24), ('director', 19), ('fbi', 19), ('first', 19), ('former', 19)]
--------------------
2-grams: [('president donald', 33), ('donald trump', 26), ('donald trumps', 17), ('fbi director', 16), ('james comey', 9), ('president michel', 9), ('saudi arabia', 9), ('white house', 9), ('michel temer', 8), ('former fbi', 7)]
--------------------
3-grams: [('president donald trump', 21), ('president donald trumps', 12), ('president michel temer', 8), ('former fbi director', 7), ('director james comey', 6), ('fbi director james', 6), ('director robert mueller', 4), ('treasury secretary steven', 4), ('us president donald', 4), ('attorney general rod', 3)]
--------------------
Category: technology
1-grams: [('google', 65), ('new', 63), ('work', 41), ('first', 37), ('us', 36), ('levandowski', 35), ('today', 35), ('one', 34), ('like', 32), ('day', 31)]
--------------------
2-grams: [('self driving', 15), ('machine learning', 10), ('neural network', 10), ('driving car', 9), ('google io', 9), ('licensing program', 8), ('windows 10', 8), ('computer science', 7), ('eight hour', 7), ('open source', 7)]
--------------------
3-grams: [('self driving car', 9), ('driving car project', 4), ('autonomous driving technology', 3), ('cluster docker swarm', 3), ('fraunhofers mp3 software', 3), ('today google io', 3), ('waymo alleges levandowski', 3), ('12 inch macbook', 2), ('assistant get help', 2), ('automation ssh terminal', 2)]
--------------------
Category: general
1-grams: [('president', 424), ('trump', 340), ('said', 331), ('us', 287), ('donald', 251), ('thursday', 191), ('ap', 179), ('new', 176), ('wednesday', 159), ('former', 128)]
--------------------
2-grams: [('president donald', 201), ('donald trump', 171), ('fbi director', 85), ('donald trumps', 78), ('james comey', 60), ('new york', 53), ('said thursday', 53), ('white house', 50), ('director james', 48), ('us president', 47)]
--------------------
3-grams: [('president donald trump', 140), ('president donald trumps', 59), ('fbi director james', 48), ('director james comey', 47), ('us president donald', 43), ('former fbi director', 40), ('national security adviser', 32), ('president michel temer', 20), ('security adviser michael', 20), ('former national security', 18)]
--------------------
Category: science-nature
1-grams: [('bike', 4), ('20', 1), ('around', 1), ('authentic', 1), ('bucket', 1), ('charlie', 1), ('country', 1), ('culture', 1), ('development', 1), ('eyes', 1)]
--------------------
2-grams: [('20 mountain', 1), ('around country', 1), ('authentic bike', 1), ('bike culture', 1), ('bike friendly', 1), ('bike hamlets', 1), ('bucket list', 1), ('charlie hamilton', 1), ('country particular', 1), ('culture town', 1)]
--------------------
3-grams: [('20 mountain bike', 1), ('around country particular', 1), ('authentic bike culture', 1), ('bike culture town', 1), ('bike friendly vibe', 1), ('bike hamlets around', 1), ('bucket list rides', 1), ('charlie hamilton james', 1), ('country particular order', 1), ('culture town needs', 1)]
--------------------

We see that in the business category, for example, many news articles tend to focus a great deal on Donald Trump, former FBI director James Comey, and former director of the Defense Intelligence Agency Michael Flynn. Let's plot the ten most frequent trigrams.

In [12]:
ngram = pd.DataFrame(keywords_ngram('general', 3, 10), columns=['ngram_3','counts'])

plot = ngram.iplot(kind="bar", x='ngram_3', y='counts', xTitle='Trigrams', color='blue',
                            yTitle='Counts', dimensions=(790,550), margin=(80,120,110,20))

Another common method used in topic modeling is called noun-phrase chunking. The idea is to find nouns and surrounding descriptive words in the text, as these phrases often carry the main point (i.e., the topic) of a sentence. I was first introduced to the idea of chunking from this super informative article about topic modeling.

To perform chunking, the first thing we have to do is define a "chunking pattern" (this is the string of symbols in the function argument). The chunking pattern tells the function which sort of phrases to look for in the text. In our case, it's one or more optional adjectives followed by one or more nouns.

In [13]:
def extract_candidate_chunks(text, grammar = 'CHUNK: {<JJ.*>*<NN.*>+}'):
    import itertools, nltk, string
    parser = nltk.RegexpParser(grammar)
    tagged_sents = [nltk.pos_tag(nltk.word_tokenize(text))]

    for chunk in tagged_sents:
        if not chunk:
            candidates = []
        else:
            candidates = []
            tree = parser.parse(chunk)
            for subtree in tree.subtrees():
                if subtree.label() == 'CHUNK':
                    candidates.append((' '.join([word for (word, tag) in subtree.leaves()])))
    candidates = [word for word in candidates if word not in stoplist]
    return candidates

Now we'll define another function to count common chunks.

In [14]:
def keyphrases_chunk(category, num):
    tokens = df[df['category'] == category]['chunk']
    
    alltokens = []
    for token_list in tokens:
        alltokens += token_list
    counter = Counter(alltokens)
    return counter.most_common(num)

Let's try it out!

In [15]:
df['chunk'] = df['pro_descr'].apply(extract_candidate_chunks)

for category in set(df['category']):
    print("Category:", category)
    print("chunk:", keyphrases_chunk(category, 15))
    print('--------------------')
Category: sport
chunk: [('goals', 16), ('season', 16), ('years', 15), ('way', 12), ('time', 11), ('players', 11), ('clubs', 10), ('game', 10), ('manchester', 9), ('games', 9), ('first time', 8), ('seasons', 8), ('premier league', 7), ('premier league season', 7), ('minutes', 7)]
--------------------
Category: entertainment
chunk: [('president', 6), ('months', 4), ('season', 3), ('golden state', 3), ('passengers', 3), ('years', 3), ('president donald', 3), ('friday', 3), ('researchers', 3), ('shes', 3), ('night', 2), ('children', 2), ('guests', 2), ('people', 2), ('los angeles', 2)]
--------------------
Category: business
chunk: [('president donald', 18), ('president', 12), ('years', 5), ('election', 4), ('officials', 4), ('people', 4), ('former fbi director', 4), ('supreme court', 3), ('fbi director james', 3), ('trump administration', 3), ('worlds', 3), ('congress', 3), ('new york ap', 3), ('months', 3), ('james', 3)]
--------------------
Category: technology
chunk: [('google', 17), ('years', 12), ('people', 11), ('today', 9), ('time', 9), ('neural network', 8), ('way', 7), ('day', 7), ('systems', 7), ('things', 7), ('technology', 7), ('program', 7), ('jobs', 6), ('year', 6), ('percent', 6)]
--------------------
Category: general
chunk: [('president', 70), ('president donald', 54), ('trump', 45), ('years', 38), ('people', 36), ('wednesday', 31), ('washington', 27), ('united states', 27), ('election', 26), ('thursday', 26), ('times', 24), ('months', 18), ('percent', 18), ('special counsel', 18), ('country', 17)]
--------------------
Category: science-nature
chunk: [('authentic bike culture town needs', 1), ('high quality', 1), ('mountain bike hamlets', 1), ('country', 1), ('particular order offer', 1), ('bucket-list rides', 1), ('new trail development variety', 1), ('outdoor recreation fun', 1), ('vibe bike', 1), ('much notice orangutans', 1), ('hand', 1), ('national geographic photographer charlie hamilton', 1), ('favorite places', 1)]
--------------------

We see that chunking also does a pretty good job in grabbing the main topics of the text. The results are similar to the mixture of unigrams, bigrams, and trigrams found using $n$-gram analysis, but they can also contain longer, more informative strings like "national geographic photographer charlie hamilton".

Term Frequency$-$Inverse Document Frequency

One very popular method used in clustering documents and in isolating "important" words within those documents is called term frequency$-$inverse document frequency (abbreviated tf-idf). The idea behind tf-idf is that a word is deemed "important" if it occurs frequently within a single document (i.e., a news article description), but is less important if it occurs many times across several documents. In this way, words common to many documents like "the", "and", etc. are given a lower tf-idf score and are therefore not very important.

How can we represent tf-idf mathematically, though? To compute a tf-idf score of a term, the name itself suggests that we might count the number of times the term appears in a document (term frequency), and weight it by the number of times it occurs across all documents (inverse document frequency). Mathematically, this is written as

$$tfidf = tf(t,d) \times idf(t,D),$$

Where $tf(t,d)$ is the frequency of some term $t$ in document $d$, and $idf$ is the inverse document frequency of term $t$ across all documents $D$, given as

$$idf(t,D) = log\frac{\big|D\big|}{1+\big|\{d\in D:t\in d\}\big|}$$

Luckily for us, Python's Scikit$-$Learn has a built in method for doing this called TfidfVectorizer. We now feed to TfidfVectorizer the processed article descriptions (punctuation and stopwords removed).

In [16]:
tfidf = TfidfVectorizer(min_df=10, max_features=10000, strip_accents='unicode',
                           analyzer='word', token_pattern=r'\w{1,}', ngram_range=(1,2),
                           use_idf=1, smooth_idf=1, sublinear_tf=1)

tfidf_descr = tfidf.fit_transform(df['pro_descr'])
In [17]:
tfidf_dict = dict(zip(tfidf.get_feature_names(), tfidf.idf_))
tfidf_df = pd.DataFrame(tfidf.idf_, index=tfidf.get_feature_names(), columns=['tfidf_val'])
tfidf_df.sort_values('tfidf_val', ascending=False, inplace=True)

Let's look at some of the unigrams and bigrams deemed most important through tf-idf.

In [18]:
tfidf_df.head(10)
Out[18]:
tfidf_val
zone 6.213947
owners 6.213947
follow 6.213947
pence 6.213947
paul ryan 6.213947
free trade 6.213947
passed 6.213947
giants 6.213947
panel 6.213947
associated press 6.213947

Performing tf-idf has converted each article description into a 1,414-demensional vector. For plotting purposes, let's reduce the dimensionality using truncated singular value decomposition and t-distributed stochastic neighbor embedding (TSNE). TSNE is a really cool technique for preserving "distances" between the tf-idf vectors while at the same time reducing their dimensionality. This is often useful in clustering.

In [19]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=50, random_state=0)
svd_tfidf = svd.fit_transform(tfidf_descr)
In [20]:
from sklearn.manifold import TSNE

tsne_model = TSNE(n_components=2, verbose=1, random_state=0, learning_rate=100)
tsne_tfidf = tsne_model.fit_transform(svd_tfidf)
[t-SNE] Computing pairwise distances...
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Computed conditional probabilities for sample 1000 / 2021
[t-SNE] Computed conditional probabilities for sample 2000 / 2021
[t-SNE] Computed conditional probabilities for sample 2021 / 2021
[t-SNE] Mean sigma: 0.124269
[t-SNE] KL divergence after 100 iterations with early exaggeration: 2.045509
[t-SNE] Error after 350 iterations: 2.045509
In [21]:
tfidf_df = pd.DataFrame(tsne_tfidf, columns=['x', 'y'])
tfidf_df['description'] = df['description']
tfidf_df['category'] = df['category']
In [22]:
tfidf_df.iplot(kind="scatter", x='x', y='y', mode='markers', size=5, text='description',
              dimensions=(790,650), margin=(50,50,40,40), color='blue')

We see that tf-idf does a fairly good job of clustering similar topics. For example there are clusters which discuss Donald Trump and James Comey, a cluster discussing sports, one discussing technology, etc.

k$-$Means Clustering

Another technique sometimes used in topic modeling to group similar documents is called k$-$means clustering. The idea is that given a set of feature vectors describing various documents like those plotted above, we can use an algorithm to identify any clusters present, and also identify common key words shared between the documents in those clusters. For this work, we'll assume there are 15 clusters present.

In [23]:
from sklearn.cluster import MiniBatchKMeans

n_clusters = 15

kmeans_model = MiniBatchKMeans(n_clusters=n_clusters, init='k-means++', n_init=100,
                               batch_size=100, verbose=False, max_iter=1000, random_state=3)

kmeans = kmeans_model.fit(tfidf_descr)
kmeans_clusters = kmeans.predict(tfidf_descr)
kmeans_distances = kmeans.transform(tfidf_descr)

Let's look at the common unigrams and bigrams shared between documents in each cluster.

In [24]:
sorted_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = tfidf.get_feature_names()

for i in range(n_clusters):
    print("Cluster %d:" % i)
    for ind in sorted_centroids[i, :10]:
        print(terms[ind], end=' / ')
    print('\n', '--------------------')
Cluster 0:
said / us / new / first / friday / thursday / may / wednesday / two / would / 
 --------------------
Cluster 1:
fbi / fbi director / director / comey / james / james comey / director james / president / trump / donald / 
 --------------------
Cluster 2:
year / last / season / back / two / club / last year / end / three / final / 
 --------------------
Cluster 3:
states / united states / united / said / turkish / thursday / us / syria / fight / said thursday / 
 --------------------
Cluster 4:
president michel / michel / temer / michel temer / president / corruption / brazils / supreme court / supreme / brazilian / 
 --------------------
Cluster 5:
rouhani / hassan / hassan rouhani / irans / president hassan / election / president / re / iranian / re election / 
 --------------------
Cluster 6:
cornell / police / death / chris / detroit / officer / singer / hotel / honor / room / 
 --------------------
Cluster 7:
old / year old / year / anthony / girl / sales / guilty / man / 15 / ap / 
 --------------------
Cluster 8:
special / special counsel / counsel / mueller / russia / investigation / robert mueller / robert / 2016 / campaign / 
 --------------------
Cluster 9:
play / play off / conference / finals / conference finals / final / off / game / western / championship / 
 --------------------
Cluster 10:
donald / trump / president / president donald / donald trump / us / saudi / trumps / donald trumps / saudi arabia / 
 --------------------
Cluster 11:
look / army / prison / classified / military / chelsea / woman / years / video / one / 
 --------------------
Cluster 12:
angeles / los / los angeles / ailes / news / fox news / fox / roger ailes / roger / flight / 
 --------------------
Cluster 13:
league / premier / premier league / season / final / goals / champions / chelsea / manchester / united / 
 --------------------
Cluster 14:
new york / york / new / square / times square / times / killing / pedestrians / car / ap / 
 --------------------

We can see that one cluster is comprised of articles discussing a relationship between the United States, Turkey, and Syria, while another focuses on James Comey and Donald Trump, and yet another is associated with Premier League soccer. We can make another plot where the various identified clusters are given different colors.

In [25]:
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, learning_rate=1500)
tsne_kmeans = tsne_model.fit_transform(kmeans_distances)

df_kmeans = pd.DataFrame(tsne_kmeans, columns=['x', 'y'])
df_kmeans['cluster'] = kmeans_clusters
df_kmeans['description'] = df['description']
df_kmeans['category'] = df['category']
[t-SNE] Computing pairwise distances...
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Computed conditional probabilities for sample 1000 / 2021
[t-SNE] Computed conditional probabilities for sample 2000 / 2021
[t-SNE] Computed conditional probabilities for sample 2021 / 2021
[t-SNE] Mean sigma: 0.012658
[t-SNE] KL divergence after 100 iterations with early exaggeration: 1.066050
[t-SNE] Error after 200 iterations: 1.066050
In [26]:
colormap = ["#6d8dca", "#69de53", "#723bca", "#c3e14c", "#c84dc9",
            "#68af4e", "#6e6cd5", "#e3be38", "#4e2d7c", "#5fdfa8",
            "#d34690", "#3f6d31", "#d44427", "#7fcdd8", "#cb4053"]

color_dict = dict(zip(range(0,n_clusters), colormap))

df_kmeans['color'] = df_kmeans['cluster'].map(color_dict)
In [38]:
import plotly.graph_objs as go

trace = go.Scatter(
    x = df_kmeans['x'],
    y = df_kmeans['y'],
    mode = 'markers',
    text = df_kmeans['description'],
    marker = dict(size='5', color=df_kmeans['color'].tolist())
    )

df_kmeans.iplot([trace], dimensions=(790,650), margin=(50,50,40,40))

Like tf-idf clustering, k$-$means clustering does a good job of separating topics. The figure clearly shows a solid ~15 or so clusters represented here.

Latent Dirichlet Allocation (LDA)

Lastly, let's try out another popular technique used in topic modeling called latent Dirichlet allocation (LDA). LDA is a statistical method that assumes documents are comprised of a mixture of $n$ topics, and that the words in those documents are all attributed in some way to those topics. A detaield explanation of LDA can be found here. To perform LDA, we can make use of the lda package found here. Performing LDA can be a bit tricky though, as the results can vary dramatically depending on our choice of the number of topics to identify, how noisy the documents are, etc. Let's see how we do with our news articles!

In [28]:
from lda import LDA
import logging

logging.getLogger("lda").setLevel(logging.WARNING)

n_topics = 15

vector = CountVectorizer(min_df=5, max_features=10000, strip_accents='unicode', analyzer='word',
                     token_pattern=r'\w{1,}', ngram_range=(1,2))

cvec = vector.fit_transform(df['pro_descr'])

model = LDA(n_topics=n_topics, n_iter=2500, random_state=0)
topics = model.fit_transform(cvec)
WARNING:lda:all zero row in document-term matrix found
In [29]:
n_top_words = 8
topic_word = model.topic_word_

for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vector.get_feature_names())[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: trump president donald president donald donald trump saudi us first
Topic 1: trump president donald president donald fbi house comey donald trump
Topic 2: may state said conference top san 20 friday
Topic 3: league season final premier premier league year club old
Topic 4: us said thursday state south united officials north
Topic 5: two first time night years 1 ap one
Topic 6: new york new york times team los los angeles angeles
Topic 7: year old year old last world made show one
Topic 8: its new day like work get many time
Topic 9: campaign special us russia election 2016 donald investigation
Topic 10: minister may said new next wednesday monday prime
Topic 11: police ap said year former news wednesday old
Topic 12: google new company technology announced today home one
Topic 13: us said president billion thursday percent million michel
Topic 14: president election leader presidential rouhani hassan irans hassan rouhani

Again, we see that there are articles discussing Donald Trump and James Comey, some discussing Premier League soccer, several about Google and new technology, etc. Let's plot the clusters.

In [30]:
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, learning_rate=3000)
tsne_lda = tsne_model.fit_transform(topics)

df_lda = pd.DataFrame(tsne_lda, columns=['x', 'y'])
df_lda['cluster'] = kmeans_clusters
df_lda['description'] = df['description']
df_lda['category'] = df['category']
[t-SNE] Computing pairwise distances...
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Computed conditional probabilities for sample 1000 / 2021
[t-SNE] Computed conditional probabilities for sample 2000 / 2021
[t-SNE] Computed conditional probabilities for sample 2021 / 2021
[t-SNE] Mean sigma: 0.137565
[t-SNE] KL divergence after 100 iterations with early exaggeration: 0.981133
[t-SNE] Error after 175 iterations: 0.981133
In [31]:
colormap = ["#6d8dca", "#69de53", "#723bca", "#c3e14c", "#c84dc9",
            "#68af4e", "#6e6cd5", "#e3be38", "#4e2d7c", "#5fdfa8",
            "#d34690", "#3f6d31", "#d44427", "#7fcdd8", "#cb4053"]

color_dict = dict(zip(range(0,n_topics), colormap))

df_lda['color'] = df_lda['cluster'].map(color_dict)
In [34]:
import plotly.graph_objs as go

trace = go.Scatter(
    x = df_lda['x'],
    y = df_lda['y'],
    mode = 'markers',
    text = df_lda['description'],
    marker = dict(size='5', color=df_lda['color'].tolist())
    )

df_lda.iplot([trace], dimensions=(790,650), margin=(50,50,40,40))

Concluding Remarks

In this project we detailed a number of popular techniques used in topic modeling and document clustering today. These include methods like $n$-gram analysis, part-of-speech chunking, tf-idf, k$-$means clustering, and latent Dirichlet allocation. Interestingly, these all happen to fall under the category of "unsupervised" learning techniques in that the algorithms process text blindly without havin any a priori knowledge of what constitutes a legitimate topic. However, it is also possible to perform topic modeling using "supervised" learning methods in which algorithms can first learn what "good" topics look like, and then go on to identify them in new, unseen text. Maybe I'll leave that for a future post!

Well, that’s all I have for now. Thanks for following along!